Skip to content

feat: update github actions workflow #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

rabioinf
Copy link

@rabioinf rabioinf commented Jan 22, 2025

Summary by CodeRabbit

  • New Features

    • Updated GitHub Actions workflows with improved configuration and version upgrades
    • Added comprehensive configuration files for genomic data processing workflow
    • Introduced new Snakemake workflow for genome data analysis
  • Documentation

    • Significantly enhanced README files with detailed workflow instructions
    • Added JSON schemas for configuration validation
    • Improved inline documentation and workflow descriptions
  • Chores

    • Updated dependency versions across multiple configuration files
    • Added new environment configurations for various bioinformatics tools
    • Introduced new workflow scripts and rules

Copy link

coderabbitai bot commented Jan 22, 2025

Walkthrough

This pull request introduces a comprehensive update to a Snakemake-based genomic data processing workflow. The changes span multiple configuration files, GitHub Actions workflows, and workflow scripts. Key modifications include updating workflow configurations, adding new environment definitions, introducing schema validations, and enhancing documentation. The workflow now supports genome data retrieval, read simulation, quality control, adapter trimming, and multi-sample report generation with improved modularity and configuration management.

Changes

File Change Summary
.github/workflows/* Updated workflow configurations, including PR validation, CI/CD triggers, and action versions
.gitignore Added .test/results/* to ignored files
config/ Added comprehensive configuration files for workflow parameters, schemas, and MultiQC settings
README.md Significantly expanded documentation with badges, workflow overview, and execution instructions
workflow/Snakefile Added configuration loading, rule inclusions, and workflow target definition
workflow/envs/ Created environment files for tools like cutadapt, fastqc, multiqc, and genome retrieval
workflow/rules/ Added modular rules for genome processing, read simulation, and quality control
workflow/scripts/get_genome.py New script for genome data retrieval and validation

Sequence Diagram

sequenceDiagram
    participant User
    participant Workflow
    participant NCBI
    participant Genome
    participant Reads
    participant QC
    participant Trimming
    participant Report

    User->>Workflow: Configure parameters
    Workflow->>NCBI: Retrieve genome data
    NCBI-->>Genome: Download FASTA/GFF
    Workflow->>Reads: Simulate sequencing reads
    Reads->>QC: Perform quality check
    QC->>Trimming: Trim adapters
    Trimming->>Report: Generate MultiQC report
    Report-->>User: Workflow complete
Loading

Poem

🐰 A Snakemake tale of genomic delight,
Workflows dancing with scientific might!
From NCBI's realm to reads so bright,
Adapters trimmed, quality shining light,
MultiQC reports, a researcher's delight! 🧬

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (1)
config/config.yml (1)

8-14: ⚠️ Potential issue

Fix duplicate keys in gff_source_type configuration.

Same issue as in test config - duplicate "RefSeq" keys will cause data loss.

🧹 Nitpick comments (12)
workflow/Snakefile (1)

32-33: Enhance error handling in the onerror hook.

The error message could be more informative by including the error details.

-    print("--- An error occurred! ---")
+    print("--- An error occurred! ---")
+    print(f"Error: {workflow.logger.get_error_message()}")
+    print("Check the log files in the results directory for more details.")
workflow/rules/process_reads.smk (1)

90-90: Consider adjusting thread allocation for cutadapt.

Using 25% of total cores might be excessive for smaller files. Consider setting a maximum limit.

-    threads: int(workflow.cores * 0.25)
+    threads: min(8, max(1, int(workflow.cores * 0.25)))
workflow/scripts/get_genome.py (1)

61-61: Improve code style and readability.

Several style improvements can be made:

  1. Use not in instead of not ... in
  2. Simplify dict key checks
-                if not "Name" in rec_keys:
+                if "Name" not in rec_keys:

-                if not "ID" in rec_keys:
+                if "ID" not in rec_keys:

-        for k in ncbi_genome.keys():
+        for k in ncbi_genome:

Also applies to: 72-72, 107-107

🧰 Tools
🪛 Ruff (0.8.2)

61-61: Test for membership should be not in

Convert to not in

(E713)

.test/config/multiqc_config.yml (1)

2-2: Add newline at end of file.

Add a newline character at the end of the file to comply with YAML best practices.

🧰 Tools
🪛 yamllint (1.35.1)

[error] 2-2: no new line character at the end of file

(new-line-at-end-of-file)

config/schemas/samples.schema.yml (1)

1-25: Consider adding additional schema validations.

The schema is well-structured, but could benefit from additional validations:

  • Pattern validation for sample names (e.g., no spaces/special chars)
  • Minimum value for replicate (e.g., >= 1)
  • Pattern validation for read files (e.g., must end in .fastq.gz)

Example additions:

   sample:
     type: string
     description: sample name/identifier
+    pattern: "^[A-Za-z0-9_-]+$"
   replicate:
     type: number
     default: 1
     description: consecutive numbers representing multiple replicates of one condition
+    minimum: 1
   read1:
     type: string
     description: names of fastq.gz files, read 1
+    pattern: ".*\\.fastq\\.gz$"
   read2:
     type: string
     description: names of fastq.gz files, read 2 (optional)
+    pattern: ".*\\.fastq\\.gz$"
config/schemas/config.schema.yml (3)

8-20: Add type and required field definitions for the get_genome object.

The schema for get_genome should specify its type and required fields for better validation.

   get_genome:
+    type: object
     properties:
       database:
         type: ["string", "null"]
+        enum: ["manual", "ncbi"]
       assembly:
         type: ["string", "null"]
+        pattern: "^GC[F|A]_[0-9]+\\.[0-9]+$"
       fasta:
         type: ["string", "null"]
       gff:
         type: ["string", "null"]
       gff_source_type:
         type: array
+        items:
+          type: string
+    required: ["database"]

21-28: Add constraints for numeric fields in simulate_reads.

The schema for simulate_reads should include minimum/maximum constraints for numeric fields to prevent invalid values.

   simulate_reads:
+    type: object
     properties:
       read_length:
         type: number
+        minimum: 1
+        maximum: 1000
       read_number:
         type: number
+        minimum: 1
       random_freq:
         type: number
+        minimum: 0
+        maximum: 1
+    required: ["read_length", "read_number", "random_freq"]

30-37: Add pattern validation for adapter sequences.

The schema for cutadapt should include pattern validation for adapter sequences to ensure they contain valid nucleotide sequences.

   cutadapt:
+    type: object
     properties:
       threep_adapter:
         type: string
+        pattern: "^-a [ATCG]+$"
       fivep_adapter:
         type: string
+        pattern: "^-A [ATCG]+$"
       default:
         type: array
+        items:
+          type: string
+    required: ["threep_adapter", "fivep_adapter"]
.github/workflows/main.yml (1)

49-51: Consider adding test artifacts retention.

Consider adding artifact retention for test results to help with debugging pipeline failures.

          directory: .test
          snakefile: workflow/Snakefile
          args: "--use-conda --show-failed-logs --cores 3 --conda-cleanup-pkgs cache"
+      - uses: actions/upload-artifact@v4
+        if: always()
+        with:
+          name: test-results
+          path: .test/results
+          retention-days: 14
config/README.md (1)

73-73: Fix code span formatting.

Remove extra spaces inside code spans to comply with Markdown formatting rules.

-[`-q 10 `, `-m 25 `, `-M 100`, `--overlap=5`]
+[`-q 10`, `-m 25`, `-M 100`, `--overlap=5`]
🧰 Tools
🪛 Markdownlint (0.37.0)

73-73: null
Spaces inside code span elements

(MD038, no-space-in-code)


73-73: null
Spaces inside code span elements

(MD038, no-space-in-code)

README.md (1)

117-122: Update list formatting for consistency.

Convert asterisks to dashes for consistent unordered list formatting.

-* Replace `<owner>` and `<repo>` everywhere in the template
-* Replace `<name>` with the workflow name
-* Replace `<description>` with a description
-* Update the workflow description
-* Update the `README.md` badges
+- Replace `<owner>` and `<repo>` everywhere in the template
+- Replace `<name>` with the workflow name
+- Replace `<description>` with a description
+- Update the workflow description
+- Update the `README.md` badges
🧰 Tools
🪛 LanguageTool

[style] ~119-~119: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...w name (can be the same as <repo>). * Replace <description> with a description of w...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 Markdownlint (0.37.0)

117-117: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)


118-118: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)


119-119: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)


120-120: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)


121-121: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)


122-122: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)

config/multiqc_config.yml (1)

1-2: Add newline at end of file.

Add a newline character at the end of the file to comply with YAML formatting standards.

 remove_sections:
   - samtools-stats
+
🧰 Tools
🪛 yamllint (1.35.1)

[error] 2-2: no new line character at the end of file

(new-line-at-end-of-file)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c7a73eb and 7a3a40e.

⛔ Files ignored due to path filters (1)
  • .test/config/samples.tsv is excluded by !**/*.tsv
📒 Files selected for processing (21)
  • .github/workflows/conventional-prs.yml (2 hunks)
  • .github/workflows/main.yml (2 hunks)
  • .github/workflows/release-please.yml (1 hunks)
  • .gitignore (1 hunks)
  • .test/config/config.yml (1 hunks)
  • .test/config/multiqc_config.yml (1 hunks)
  • README.md (1 hunks)
  • config/README.md (1 hunks)
  • config/config.yml (1 hunks)
  • config/multiqc_config.yml (1 hunks)
  • config/schemas/config.schema.yml (1 hunks)
  • config/schemas/samples.schema.yml (1 hunks)
  • workflow/Snakefile (1 hunks)
  • workflow/envs/cutadapt.yml (1 hunks)
  • workflow/envs/fastqc.yml (1 hunks)
  • workflow/envs/get_genome.yml (1 hunks)
  • workflow/envs/multiqc.yml (1 hunks)
  • workflow/envs/simulate_reads.yml (1 hunks)
  • workflow/rules/common.smk (1 hunks)
  • workflow/rules/process_reads.smk (1 hunks)
  • workflow/scripts/get_genome.py (1 hunks)
✅ Files skipped from review due to trivial changes (7)
  • workflow/envs/multiqc.yml
  • workflow/envs/get_genome.yml
  • workflow/envs/fastqc.yml
  • workflow/envs/simulate_reads.yml
  • .gitignore
  • .github/workflows/release-please.yml
  • workflow/envs/cutadapt.yml
🧰 Additional context used
🪛 yamllint (1.35.1)
.test/config/multiqc_config.yml

[error] 2-2: no new line character at the end of file

(new-line-at-end-of-file)

config/multiqc_config.yml

[error] 2-2: no new line character at the end of file

(new-line-at-end-of-file)

🪛 LanguageTool
config/README.md

[style] ~81-~81: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...w name (can be the same as <repo>). * Replace <description> with a description of w...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

README.md

[uncategorized] ~53-~53: You might be missing the article “the” here.
Context: ...### Execution To run the workflow from command line, change the working directory. ``...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)


[style] ~119-~119: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...w name (can be the same as <repo>). * Replace <description> with a description of w...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 Markdownlint (0.37.0)
config/README.md

73-73: null
Spaces inside code span elements

(MD038, no-space-in-code)


73-73: null
Spaces inside code span elements

(MD038, no-space-in-code)

README.md

117-117: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)


118-118: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)


119-119: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)


120-120: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)


121-121: Expected: dash; Actual: asterisk
Unordered list style

(MD004, ul-style)


113-113: null
Bare URL used

(MD034, no-bare-urls)


100-100: null
Spaces inside code span elements

(MD038, no-space-in-code)


100-100: null
Spaces inside code span elements

(MD038, no-space-in-code)

🪛 Ruff (0.8.2)
workflow/scripts/get_genome.py

17-17: Undefined name snakemake

(F821)


18-18: Undefined name snakemake

(F821)


19-19: Undefined name snakemake

(F821)


20-20: Undefined name snakemake

(F821)


21-21: Undefined name snakemake

(F821)


22-22: Undefined name snakemake

(F821)


23-23: Undefined name snakemake

(F821)


24-24: Undefined name snakemake

(F821)


29-29: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


29-29: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


41-41: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


41-41: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


47-47: f-string without any placeholders

Remove extraneous f prefix

(F541)


55-55: Undefined name snakemake

(F821)


61-61: Test for membership should be not in

Convert to not in

(E713)


72-72: Test for membership should be not in

Convert to not in

(E713)


107-107: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

🪛 actionlint (1.7.4)
.github/workflows/main.yml

29-29: the runner of "actions/checkout@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🔇 Additional comments (6)
workflow/rules/common.smk (1)

16-17: 🛠️ Refactor suggestion

Verify schema paths and add error handling.

The schema paths are using relative paths (../../config/schemas/). This could be problematic if the working directory changes. Consider:

  1. Using workflow.source_path() for reliable path resolution
  2. Adding error handling for missing schema files
-validate(samples, schema="../../config/schemas/samples.schema.yml")
-validate(config, schema="../../config/schemas/config.schema.yml")
+validate(samples, schema=workflow.source_path("../config/schemas/samples.schema.yml"))
+validate(config, schema=workflow.source_path("../config/schemas/config.schema.yml"))
.test/config/multiqc_config.yml (1)

1-2: LGTM! Configuration looks valid.

The removal of samtools-stats section from the MultiQC report is properly configured.

🧰 Tools
🪛 yamllint (1.35.1)

[error] 2-2: no new line character at the end of file

(new-line-at-end-of-file)

.github/workflows/conventional-prs.yml (1)

Line range hint 1-20: LGTM! Workflow updates follow best practices.

Good improvements:

  • Explicit permissions following principle of least privilege
  • Updated action version for security
  • Clear job naming
config/config.yml (1)

1-27: Verify path consistency between test and production configs.

The paths differ between test and production configs:

  • Test config: config/samples.tsv
  • Production config: .test/config/samples.tsv
  • Test config: config/multiqc_config.yml
  • Production config: config/multiqc_config.yml

Please verify the correct paths and ensure consistency.

.github/workflows/main.yml (1)

40-43: Add missing job dependencies.

The Testing job should depend on both Linting and Formatting jobs to ensure proper execution order.

config/README.md (1)

3-82: Complete the workflow description placeholders.

Several placeholders need to be replaced with actual content:

  • <detailed description> on line 3
  • <name> in the title
🧰 Tools
🪛 LanguageTool

[style] ~81-~81: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...w name (can be the same as <repo>). * Replace <description> with a description of w...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 Markdownlint (0.37.0)

73-73: null
Spaces inside code span elements

(MD038, no-space-in-code)


73-73: null
Spaces inside code span elements

(MD038, no-space-in-code)

Comment on lines +92 to +93
"cutadapt {params.threep_adapter} {params.fivep_adapter} --cores {threads} "
"-o {output.fastq1} -p {output.fastq2} {input.fastq1} {input.fastq1} > {log.stdout} 2> {log.stderr}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix incorrect input file reference in cutadapt command.

The second input file reference is incorrect, causing the same file to be processed twice.

-        "-o {output.fastq1} -p {output.fastq2} {input.fastq1} {input.fastq1} > {log.stdout} 2> {log.stderr}"
+        "-o {output.fastq1} -p {output.fastq2} {input.fastq1} {input.fastq2} > {log.stdout} 2> {log.stderr}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"cutadapt {params.threep_adapter} {params.fivep_adapter} --cores {threads} "
"-o {output.fastq1} -p {output.fastq2} {input.fastq1} {input.fastq1} > {log.stdout} 2> {log.stderr}"
"cutadapt {params.threep_adapter} {params.fivep_adapter} --cores {threads} "
"-o {output.fastq1} -p {output.fastq2} {input.fastq1} {input.fastq2} > {log.stdout} 2> {log.stderr}"

Comment on lines +29 to +39
def check_fasta(input_fasta, log=[], error=[]):
with open(input_fasta, "r") as fasta_file:
fasta = fasta_file.read()
n_items = fasta.count(">")
if n_items:
log += [f"Supplied fasta file '{input_fasta}' was found"]
log += [f"Supplied fasta file contains {n_items} items"]
else:
error += ["The supplied fasta file contains no valid entries starting with '>'"]
return fasta, log, error

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix mutable default arguments and improve code style.

The functions use mutable default arguments which can lead to unexpected behavior.

-def check_fasta(input_fasta, log=[], error=[]):
+def check_fasta(input_fasta, log=None, error=None):
+    log = [] if log is None else log
+    error = [] if error is None else error

-def check_gff(input_gff, log=[], error=[]):
+def check_gff(input_gff, log=None, error=None):
+    log = [] if log is None else log
+    error = [] if error is None else error

Also applies to: 41-82

🧰 Tools
🪛 Ruff (0.8.2)

29-29: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


29-29: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

Comment on lines +17 to +24
input_database = snakemake.params["database"]
input_assembly = snakemake.params["assembly"]
input_fasta = snakemake.params["fasta"]
input_gff = snakemake.params["gff"]
output_path = snakemake.output["path"]
output_fasta = snakemake.output["fasta"]
output_gff = snakemake.output["gff"]
output_log = snakemake.log["path"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add snakemake import and type hints.

The script uses the snakemake variable without importing it. Also, consider adding type hints for better code maintainability.

+from snakemake.script import Snakemake
+
+# Type hint for the snakemake object
+snakemake: Snakemake
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
input_database = snakemake.params["database"]
input_assembly = snakemake.params["assembly"]
input_fasta = snakemake.params["fasta"]
input_gff = snakemake.params["gff"]
output_path = snakemake.output["path"]
output_fasta = snakemake.output["fasta"]
output_gff = snakemake.output["gff"]
output_log = snakemake.log["path"]
from snakemake.script import Snakemake
# Type hint for the snakemake object
snakemake: Snakemake
input_database = snakemake.params["database"]
input_assembly = snakemake.params["assembly"]
input_fasta = snakemake.params["fasta"]
input_gff = snakemake.params["gff"]
output_path = snakemake.output["path"]
output_fasta = snakemake.output["fasta"]
output_gff = snakemake.output["gff"]
output_log = snakemake.log["path"]
🧰 Tools
🪛 Ruff (0.8.2)

17-17: Undefined name snakemake

(F821)


18-18: Undefined name snakemake

(F821)


19-19: Undefined name snakemake

(F821)


20-20: Undefined name snakemake

(F821)


21-21: Undefined name snakemake

(F821)


22-22: Undefined name snakemake

(F821)


23-23: Undefined name snakemake

(F821)


24-24: Undefined name snakemake

(F821)

Comment on lines +85 to +131
if input_database.lower() == "ncbi":
ncbi_result = getoutput(
f"datasets summary genome accession {input_assembly} --as-json-lines | "
+ "dataformat tsv genome --fields accession,annotinfo-release-date,organism-name"
)
if ncbi_result.startswith("Error"):
error += [ncbi_result]
error += [
"The supplied refseq/genbank ID was not valid. Example for correct input: 'GCF_000009045.1'"
]
elif len(ncbi_result) == 0:
error += [
"The result from fetching NCBI genome data has zero length. Please check your internet connection!"
]
else:
ncbi_genome = [
i.split("\t")
for i in ncbi_result.split("\n")
if not (i.startswith("New version") or i.startswith("Warning"))
]
ncbi_genome = dict(zip(ncbi_genome[0], ncbi_genome[1]))
log += ["Found the following genome(s):"]
for k in ncbi_genome.keys():
log += ["{0}: {1}".format(k, ncbi_genome.get(k))]
refseq_id = ncbi_genome.get("Assembly Accession")
if not refseq_id.startswith("GCF_"):
error += ["The RefSeq ID '{0}' has no valid format.".format(refseq_id)]
ncbi_command = (
f"datasets download genome accession {refseq_id}"
+ f" --filename {output_path}/database.zip --include genome,gff3; "
+ f"cd {output_path}; unzip database.zip; rm database.zip"
)
copy_command = (
f"cp {output_path}/ncbi_dataset/data/{refseq_id}/*.fna {output_fasta}; "
+ f"cp {output_path}/ncbi_dataset/data/{refseq_id}/genomic.gff {output_gff}"
)
index_command = f"samtools faidx {output_fasta}"
str_out = getoutput(ncbi_command)
str_cp = getoutput(copy_command)
# import and check files
fasta, log, error = check_fasta(output_fasta, log, error)
index_out = getoutput(index_command)
gff, log, error = check_gff(output_gff, log, error)
# write cleaned gff file
with open(output_gff, "w") as gff_out:
GFF.write(gff, gff_out)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add timeout handling for NCBI API calls.

The script should handle potential timeouts when fetching data from NCBI.

+import requests.exceptions
+from subprocess import getoutput, run, PIPE, TimeoutError

+def execute_with_timeout(command, timeout=300):
+    try:
+        result = run(command, shell=True, stdout=PIPE, stderr=PIPE, timeout=timeout, text=True)
+        return result.stdout
+    except TimeoutError:
+        return "Error: Command timed out after {} seconds".format(timeout)

-    ncbi_result = getoutput(
+    ncbi_result = execute_with_timeout(
         f"datasets summary genome accession {input_assembly} --as-json-lines | "
         + "dataformat tsv genome --fields accession,annotinfo-release-date,organism-name"
     )

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.8.2)

107-107: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Comment on lines +8 to +14
gff_source_type:
[
"RefSeq": "gene",
"RefSeq": "pseudogene",
"RefSeq": "CDS",
"Protein Homology": "CDS",
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix duplicate keys in gff_source_type configuration.

The current structure has duplicate "RefSeq" keys which will cause only the last entry to be used. This means the "gene" and "pseudogene" entries will be lost.

Restructure the configuration to preserve all entries:

   gff_source_type:
     [
-      "RefSeq": "gene",
-      "RefSeq": "pseudogene",
-      "RefSeq": "CDS",
+      {"source": "RefSeq", "type": "gene"},
+      {"source": "RefSeq", "type": "pseudogene"},
+      {"source": "RefSeq", "type": "CDS"},
       "Protein Homology": "CDS",
     ]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
gff_source_type:
[
"RefSeq": "gene",
"RefSeq": "pseudogene",
"RefSeq": "CDS",
"Protein Homology": "CDS",
]
gff_source_type:
[
{"source": "RefSeq", "type": "gene"},
{"source": "RefSeq", "type": "pseudogene"},
{"source": "RefSeq", "type": "CDS"},
"Protein Homology": "CDS",
]

directory: .
snakefile: workflow/Snakefile
args: "--lint"
- uses: actions/checkout@v2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Update the checkout action version.

The runner of "actions/checkout@v2" action is outdated. Update to v4 for consistency with other jobs.

-      - uses: actions/checkout@v2
+      - uses: actions/checkout@v4
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- uses: actions/checkout@v2
- uses: actions/checkout@v4
🧰 Tools
🪛 actionlint (1.7.4)

29-29: the runner of "actions/checkout@v2" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

@rabioinf rabioinf closed this Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants